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AMENDMENTS TO THE CLAIMS 
In accordance with the PTO's amendment format, a detailed listing of all 
claims has been provided. A status identifier is provided for each claim in 
parentheses following each claim number. Changes to the claims are shown by 
strikethrough or double bracketing (for deleted text) or underlining (for added text), 

In the Claims: 

Claims 1, 2-6, 8-23, and 25-35 were previously pending. 

Claims 1 and 28 are amended. 

Claims 2, 8, 20-23, and 25-27 are canceled. 

New claims 36-39 are added. 

Claims l t 3-6, 9-19, and 28-39 are pending. 
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Listing of the Claims 

L (Currently amended) A method of using a tuning set of 
information to jointly optimize the performance and size of a language model. 
comprising: 

dov e loping q language model from a tuning oot of information ^ 

segmenting at least a subset of a received textual corpus into segments by 
clustering every N-items of the received corpus into a training unit, wherein 
resultant training units are separated by gaps; 

creating the tuning set from application-specific information; 

fa) training a seed model via the tuning set: 

£tiL_calculating a &e similarity within a sequence of the training units 
chunks on either side of each of the gaps; 

(c) selecting segment boundaries that maximize intra-segmcnt similarity 
and inter-segment disparity; 

(d) calculating a perplexity value for each segment based on a 

comparison with the seed model : 

f e) selecting some of the segments based on their respective perplexity 
values to augment the tuning set; 

iterativelv refining the tuning set and the seed model by repeating steps fa) 
through (6) until a threshold: 

and 
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refining the language model based on the seed model with one or more 
segments of th e received coipus bagod, at least in part, on th e calculat e d p e rplexity 
value for the on e or more s e gm e nts , 

2. (Canceled) 

3. (Original) A method according to claim 1, wherein the tuning set 
of information is comprised of one or more application-specific documents. 

4. (Original) A method according to claim 1, wherein the tuning set 
of information is a highly accurate set of textual information linguistically relevant 
to, but not taken from, the received textual corpus* 

5. (Original) A method according to claim 1, further comprising a 
training set comprised of at least the subset of the received textual corpus. 

6. (Original) A method according to claim 5 3 further comprising: 
ranking the segments of the training set based, at least in part, on the 

calculated perplexity value for each segment. 

7„ (Canceled) 
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8. (Canceled) 

9. (Previously presented) A method according to claim 1, wherein 
N is an empirically derived value based, at least in part, on the size of the received 
corpus. 

10. (Previously presented) A method according to claim 1, wherein 
the calculation of the similarity within a sequence of training units defines a 
cohesion score. 

11. (Original) A method according to claim 10, wherein intra- 
segment similarity is measured by the cohesion score. 

12. (Previously presented) A method according to claim 10, 
wherein inter-segment disparity is approximated from the cohesion score. 

13. (Original) A method according to claim 1, wherein the 
calculation of inter-segment disparity defines a depth score. 
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14. (Original) A method according to claim 1 , wherein the perplexity 
value is a measure of the predictive power of a certain language model to a 
segment of the received corpus. 

15. (Original) A method according to claim 1 , further comprising: 
ranking the segments of at least the subset of the received corpus based, at 

least in part, on the calculated perplexity value of each segment; and 

updating the tuning set of information with one or more of the segments 
from at least the subset of the received corpus. 

16. (Original) A method according to claim 15, wherein one or more 
of the segments with the lowest perplexity value from at least the subset of the 
received corpus are added to the tuning set. 

1 7. (Original) A method according to claim I , further comprising: 
utilizing the refined language model in an application to predict a 

likelihood of another corpus* 



18. (Original) A storage medium comprising a plurality of executable 
instructions including at least a subset of which, when executed, implement a 
method according to claim 1 . 
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19. (Original) A system comprising: 

a storage medium having stored therein a plurality of executable 
instructions; and 

an execution unit, coupled to, the storage medium, to execute at least a 
subset of the plurality of executable instructions to implement a method according 
to claim 1. 

20-27. (Canceled) 

28. (Currently amended) A modeling agent comprising: 
a controller, to receive invocation requests to develop a language model 
from a corpus; and 

a data structure generator, responsive to the controller, to: 

develop a seed language model from a tuning set of information; 

segment at least a subset of a received corpus, wherein[[:]] the 
segments of the received corpus are a clustering of every N items of the 
received corpus into a training unit[[;]] and the training units are separated 
by gaps; 

calculate the similarity within a sequence of training units on either 
side of each of the gaps; 

select segment boundaries that improve intra-segment similarity and 
inter-segment disparity; 
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calculate a perplexity value for each segment; and 

refine the seed language model with one or more segments of the 

received corpus based, at least in part, on the calculated perplexity values; 
iterativelv refine the tuning set with segments Tanked by the seed 

model and in turn iterativelv update the seed model via the refined tuning 

set; 

filter the received corpus via the seed model to find low-perplexity 
segments: and 

train the language model via the low-perplexitv segments . 

29. (Original) A modeling agent according to claim 28, wherein the 
tuning set is dynamically selected as relevant to the received corpus, 

30. (Original) A modeling agent according to claim 28, the data 
structure generator comprising: 

a dynamic lexicon generation function, to develop an initial lexicon from 
the tuning set, and to update the lexicon with select segments from the received 
corpus. 
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31. (Original) A modeling agent according to claim 28, the data 
structure generator comprising: 

a frequency analysis function, to determine a frequency of occurrence of 
segments within the received corpus. 

32. (Original) A modeling agent according to claim 28, the data 
structure generator comprising: 

a dynamic segmentation function, to iteratively segment the received 
corpus to improve a predictive performance attribute of the modeling agent. 

33* (Original) A modeling agent according to claim 32, wherein the 
dynamic segmentation function iteratively re-segments the received corpus until 
the language model reaches an acceptable threshold. 

34. (Original) A modeling agent according to claim 32, the data 
structure generator further comprising: 

a frequency analysis function, to determine a frequency of occurrence of 
segments within the received corpus, 

35. (Original) A modeling agent according to claim 34, wherein the 
data structure generator selectively removes segments from the date structure that 
do not meet a minimum frequency threshold, and dynamically re-segments the 
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received corpus to improve predictive 
data structure. 
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capability while reducing the size of the 



36. (New) A method of jointly optimizing the performance and size of a 
language model, comprising: 

segmenting one or more relatively large language corpora into multiple segments 
of equal size; 

selecting an initial tuning sample of application-specific data, the initial tuning 
sample being relatively small in comparison to the one or more relatively large language 
corpora, wherein the initial tuning sample is used for training a seed model, the seed 
model to be used for ranking the multiple segments from the language corpora; 

iteratively training the seed model to obtain a mature seed model, wherein the 
iterative training proceeds until a threshold is reached, each iteration of the training 
including: 

updating the seed model according to the tuning sample; 
ranking each of the multiple segments according to a perplexity comparison 
with the seed model; 

selecting some of the multiple segments that possess a low perplexity; and 
augmenting the tuning sample with the selected segments; 

once the threshold is reached, filtering the language corpora according to the 
mature seed model to select low-perplexity segments; 

combining data from the low-perplexity segments; and 

training the language model according to the combined data. 
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37. (New) The method as recited in claim 36, wherein the selecting an 
initial tuning sample comprises selecting a few application-specific documents. 

38. (New) The method as recited in claim 36, wherein the threshold 
comprises one of a predetermined size of the seed model or a sufficient application 
specificity of the seed model. 

39. . (New) The method as recited in claim 36, further comprising 
pruning the language model utilizing an entropy based cutoff algorithm that uses only 
information embedded in the language model itself 
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